Skip to content

hotfix(events): long-poll releases pool slot via commit between iterations (P0)#100

Merged
mikemolinet merged 1 commit into
mainfrom
primary/p0-long-poll-pool-exhaustion
May 22, 2026
Merged

hotfix(events): long-poll releases pool slot via commit between iterations (P0)#100
mikemolinet merged 1 commit into
mainfrom
primary/p0-long-poll-pool-exhaustion

Conversation

@mikemolinet
Copy link
Copy Markdown
Collaborator

🚨 P0 HOTFIX — long-poll holds DB txn → pool exhaustion

Cross-port from cueapi/cueapi#925 (private, mergeCommit 359961b).

Incident: a downstream consumer (cue.dock.svc) experienced pool exhaustion in prod due to _run_long_poll_wait holding an idle in transaction connection for the full ~30s wait window.

Mechanism

At ~15 concurrent long-pollers (SQLAlchemy default pool_size=5 + max_overflow=10 = 15):

GET /v1/agents/{ref}/events?wait=long appears to BEGIN a transaction,
SELECT events, and then HOLD that transaction (and its connection) 
open for the full ~30s wait when there are no new events.

pg_stat_activity:
- 15 connections stuck in `idle in transaction`
- Each holding `SELECT events.id, events.event_type ...` for 14-20+ seconds
- Matches SQLAlchemy default pool ceiling
- App-side pool 100% saturated by waiting long-polls

Root cause: _run_long_poll_wait while loop runs await asyncio.sleep + await pull_events(db, ...) with no commit between iterations. Each pull_events opens an implicit txn that the session holds for the full ~30s wait.

Fix (+2 LOC)

  1. await db.commit() at function entry — releases caller's initial-pull implicit txn before first sleep window
  2. await db.commit() after each empty pull_events() iteration — releases per-iteration implicit txn before next sleep window

Empirical verification (load-bearing)

The fix relies on AsyncSession.commit() returning the connection to the pool. Verified empirically against real Postgres + asyncpg + SQLAlchemy 2.0.35 via engine.pool.checkedin()/checkedout() state inspection:

After SELECT 1: checked_in=0, checked_out=1
After commit:   checked_in=1, checked_out=0   ← RETURNED to pool ✓
After sleep:    checked_in=1, checked_out=0   ← stays in pool
After SELECT 2: checked_in=0, checked_out=1   ← re-acquired transparently

Session remains usable across commits; both Postgres idle in transaction AND SQLAlchemy pool slot exhaustion are addressed.

Tests

  • 2 new regression-guard tests added pinning the commit invariant:
    • test_helper_commits_at_entry_and_between_empty_iterations
    • test_helper_commits_at_entry_even_when_event_arrives_immediately
  • 13/13 long-poll tests pass locally

Tag plan

Suggesting messaging-v1.1.5-hotfix tag on merge. Single-purpose hotfix tag for downstream consumers to pin-bump on.

Parity reference

Same diff shape as private cueapi/cueapi#925 (mergeCommit 359961b). Per CLAUDE.md OSS-vs-private parity policy — app/routers/events.py:_run_long_poll_wait shares verbatim across both repos.

🤖 Generated with Claude Code

…tions (P0)

P0 hotfix port: a downstream consumer (cue.dock.svc) experienced pool
exhaustion in prod due to `_run_long_poll_wait` holding an idle-in-
transaction connection for the full ~30s wait window. At ~15 concurrent
long-pollers (SQLAlchemy default pool_size=5 + max_overflow=10 = 15)
the entire app-side pool gets saturated; new requests time out waiting
for a free connection.

Root cause: the while loop runs `await asyncio.sleep` + `await
pull_events(db, ...)` with NO commit between iterations. Each
pull_events opens an implicit txn that the session holds for the full
wait window.

Fix (+2 LOC): `await db.commit()` at function entry (releases caller's
initial-pull implicit txn before first sleep window) + after each
empty iteration (releases per-iteration txn before next sleep window).

Empirically verified (2026-05-22) that AsyncSession.commit() returns
the pool slot AND ends the Postgres txn; session re-acquires from pool
transparently on next operation.

Tests:
- 2 new regression guards added to tests/test_events_long_poll.py
  pinning the commit invariant (commit_count assertions)
- 13/13 long-poll tests pass locally

Cross-port from cueapi/cueapi#925 (mergeCommit 359961b).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Parity check

This PR modifies files tracked in parity-manifest.json:

  • app/routers/events.py

Please confirm one of the following in a reply or PR description update:

  1. The equivalent change has been applied to the private cueapi monorepo. Link the PR.
  2. This change is OSS-only and does not need porting. Briefly explain why (e.g. "fixes a bug that only exists in the OSS build").
  3. A follow-up issue has been filed to port the reverse direction. Link the issue.

This is a soft check — it does not block merge. The goal is visibility, not friction. See HOSTED_ONLY.md for the open-core policy.

@govindkavaturi-art govindkavaturi-art enabled auto-merge (squash) May 22, 2026 02:36
@mikemolinet mikemolinet merged commit 5ff8989 into main May 22, 2026
6 checks passed
@mikemolinet mikemolinet deleted the primary/p0-long-poll-pool-exhaustion branch May 22, 2026 02:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant